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Abstract 

We study the problem of determining the Hamiltonian of a fully connected Ising 
Spin Glass of N units from a set of measurements, whose sizes needs to be 0(N 2 ) 
bits. The student-teacher scenario, used to study learning in feed-forward neural 
networks, is here extended to spin systems with arbitrary couplings. The set of mea- 
surements consists of data about the local minima of the rugged energy landscape. 
We compare simulations and analytical approximations for the resulting learning 
curves obtained by using different algorithms. 
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1 Introduction 



The study of the dynamics or statistical properties of a system usually consists 
in making predictions of its behavior based on assumed microscopic laws such 
as, for example, using knowledge about its Hamiltonian. However, ill posed 
and inverse problems can be found in a vast array of areas. Typically, in 
these cases, the problem is not to find the behavior, but rather to obtain the 
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microscopic laws that gave rise to it. Among the many interesting questions 
that can be asked, we point out those about the structure of the law, its 
uniqueness and how well it can be determined based on partial information. 

Inverse problems of different degrees of difficulty that have been subject of 
recent intense research activity include rule extraction and learning in artifi- 
cial systems; pattern recognition, clustering and categorization problems; to 
find out the sequence of amino acids that leads to a predetermined chemi- 
cal activity; obtaining the parameters of a dynamical system from the time 
series it generates; obtaining renormalized Hamiltonians from Monte Carlo 
Renormalization Group data etc. 

In dealing with these problems, techniques from statistics, combinatorial op- 
timization, statistical mechanics, dynamical theory and other areas have been 
found useful to different degrees. In this paper, we deal with the problem of de- 
termining the Hamiltonian of a spin glass from data about its metastable states 
(MS), that is, learning a spin glass. Related issues has been recently addressed 
by Kanter and Gotesdyner [1]. In particular, they "show(ed) that static prop- 
erties determine the dynamics for a large class of systems" and asked whether 
"Classical spin systems with the same MS have the same Hamiltonians" . The 
affirmative answer, for a large class of systems, immediately calls for the fol- 
lowing question — how hard is it to determine the Hamiltonian from partial 
information about the MS? In trying to answer this, we use ideas from learning 
in neural networks [2]. 



2 On-line learning in a Spin Glass System 

A set of MS is used as a learning set in the student-teacher scenario. The 
teacher being the original classical fully connected spin system of Hamiltonian 
% — jjJ2 BijSiSj. The student, another system of a similar structure H s = 
jr JijSiSj, is our approximation for the teacher, being constructed from the 
MS data. 

Here we specialize to the case of an Ising Spin Glass teacher whose couplings 
are drawn independently from the distribution P(Bij) = \5{Bij — \) + \5{Bij + 
1). The self-couplings B„ are set to zero. The training set is generated by 
letting a randomly chosen initial teacher configuration So relax to the nearest 
local minimum S* = S(t — > oo) according to a zero temperature aligning-field 
dynamics, 




(1) 



2 



Both teacher and student spin systems are equivalent to iV fully connected 
perceptrons. The i-th student perceptron learns from a set of v — 1, . . . ,p — 
aN examples, Li = {S' 1 " ,a^}. The input vector 

s i = {s* 1 ,...,s;_ 1 ,o,s* +1 ,...,s* N } (2) 



for site i is obtained from the metastable configuration by setting to zero the 
i-th. component; that component is the desired output <Xj = S*. The task of 
the learning process is to build a student with the same energy landscape of 
the teacher system or, equivalently, to estimate the Hamiltonian parameters 
{Bij}. We show that it is possible using only information contained in a small 
set of MS (O(N)) in comparison to the exponential number of spin-glass local 
minima (exp(0.19iV) [1,3]). 

In the online learning [4,5] strategy, each example is presented only once in- 
ducing a change in the synaptic weights in the following way, 

J tJ (v + l) = J lJ (v) + ±F t (v)a»S*» (fori^j). (3) 

The function F^u) modulates the Hebb term <j\ S* u and characterizes different 
learning algorithms. The self-couplings are always set to Ju = 0. 

The macroscopic description of the learning process involves the quantities 
Qi = J2f JijJij/N (the squared norm of the i-th. student perceptron), Mi = 
J2f BijBij/N (the corresponding teacher squared norm). We also define the 
normalized teacher-student overlap 

1 1 N N B J 



which will be our performance measure as a function of the number of ex- 
amples aN. The normalized teacher and student local fields at site i are 
h = Ef B^/VW and h t = £f JijSj/VQiN. 

In the simpler case of feed-forward networks, the update rule Eq. (3) leads, in 
the thermodynamic limit, to a set of coupled differential equations describing 
the order parameters learning dynamics [5]. In order to proceed and obtain 
similar equations we need to make several approximations which will even- 
tually be checked by simulations. First, we ignore the correlation among the 
minima and thereby do not take into account the effects that different choices 
of the particular sequence of MS will have. Second, we assume self-averaging 
of the order parameters and site symmetric evolution (pi(a) = p(a)). Fi- 
nally, we assume that the relevant features of -P(S) can be incorporated into 
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an uncorrelated teacher local field distribution P({bi}) = Y\iP{h)- This ob- 
viously is not the case in the spin-glass problem. Even if the true teacher 
local fields distribution P(b) was known, the distribution of local minima 
P(S*) has special directions related in a complicated way to the vectors B { = 
{Bij}j=i t . ^N. It will be interesting, however, to compare the performance in 
the spin glass problem with the theoretical results for a simple approxima- 
tion for the distribution of local fields suggested by Palmer and Pond [6]: 
P(b) = (2s 2 ) _1 |6| exp |6 2 /s 2 ) . Although this has been proposed for the 
fields of global minima states, it is also a good approximation for local minima 
fields [1]. The parameter s is adjustable, and in our simulations we obtained 
s « 1.05. 

Within these approximations, instead of 2N learning equations (two for each 
site), we need only two: 



dQ 
da 




(5) 



where (...) = j db dh (...)P(b,h). The joint density P(b,h) can be written 
as P(h\b)P(b), with P{h\b) = [2tt(1 - p 2 )f 1/2 exp [-\{h - pb) 2 /{l - p 2 )] and 
P{b) being the particular teacher fields distribution to be studied. 

The asymptotic (a — > oo) behavior of the 'order' parameter p is a quality 
measure for comparing the algorithms. In this limit we write 

1 - p(a) w Ca~ 2x , (6) 



where x is the usual learning exponent considered in the literature. 



3 Results 



The internal fields in the spin glass examples are correlated in some unknown 
way. It is interesting to observe the effect of these correlations by comparing 
with the case where the teacher local fields are assumed to be independently 
distributed (P({5;} = Y\iP(bi)) according to 

p '-» (i " = ^) exp (-^)- (7 > 
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parametrized by r and s; Z(r, s) ensures normalization. The Palmer-Pond 
distribution is achieved by setting r — 1. 



The learning equations (5) are exact in the thermodynamic limit if the learning 
sets £ were generated by choosing random independent identically distributed 
examples whose teacher fields obey the above Palmer-Pond-like distribution 
(this case will be denoted IID PP). Standard analytical calculations [5] in the 
p — > 1 limit, using the distribution (7) lead us to the following results: 



Simple Hebb rule, F — 1: we obtained x 
prefactors as functions of r and s are given by 



1/2, independent of r; the 



C(0,s) = -^, C(l,s) = ^-, 



4s 2 



C(r, s) 



r(^) 



i 



(8) 



Rosemblatt Perceptron algorithm, F = Q(—ah): the learning exponent 
is x = 1/(3 + r), with the following prefactors: 



2/3 



C(0,a) 



C(l,s) 




C(r>0,s) = - 



Z(r,s)I(r,0) 



3tt 

1 2/(r+3) 



4(r + 3)/(r,l)/(r + l,0) 



(9) 



with Z(r > 0,s) = 2s r+1 v / 2¥/(r - 1,0) and J(r,n) = J™ dz z r J™ Du u n ; 
Du = <i-uexp(— v? j2)j \phx and T(x) is the Gamma function. All the I(r,n) 
integrals can be found by using 



J(r '°^(^Tj72T r (- 

/(r + l,l) = (r + l)/(r,0), (10) 
J(r, n + 2) = (r + n + l)7(r + n, 0) + (n + l)/(r, n) , 



starting from 1(0,0) = 1/V27T and 7(0, 1) = 1/2. 

It is worth to note that, due to the behavior x = 1/(3 + r), the Perceptron 
algorithm shows only partial learning in the limit r — > oo (as indicated by 
x — > 0), that is, when P(6) goes to zero exponentially as b — > 0, (say, as 
exp(— 1/6 2 )). This condition relates to, but is weaker than, the case of distri- 
butions with a gap around 6 = discussed by Reimann and Van den Broeck 
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[8]. Another interesting point is that, since in principle r may be any real 
non-negative number, the learning exponent x can assume real values which 
are not simple fractions. The previous ubiquity of these fractions found in the 
literature reflects simply the particular choices for the small b behavior of the 
distributions P(S) studied so far. 

We compare these analytical results with simulations for examples generated 
by the IID PP case with r = 1 and s = 1.05, and also with simulation 
results for the spin-glass teacher, see Fig.l. Since the spin glass local minima 
distribution has structure and special directions (related in an unknown way 
with the matrix {B^}), we expect that the results are only approximate. 




Fig. 1. Evolution of p as a function of a. Lower curves are for a Spin glass teacher 
(N = 399): Hebb (balls), Perceptron (squares) and F op t(l, 0) (dashed). Upper curves 
are for the IID PP case (N = 99): Hebb (circles) and Perceptron (squares) and 
Fopt(0,l) (solid). This last curve is slightly above the Hebb curve. Insert: asymp- 
totical behavior oil — p for Perceptron (squares) and Hebb (circles) for the IID PP 
case. Solid lines are theoretical curves. 

Indeed, the simulations for examples distributed exactly according to (7) are 
in excellent agreement with the theoretical predictions, see Table 1 and Fig. 1 
(insert). But in the simulations with the spin glass teacher, simple Hebb learn- 
ing stops at p(a — > oo) ~ 0.57 (partial learning). Analogously to the results 
of Riegler et al. [7] , this can be attributed to the presence of special directions 
in P(S*) not aligned with B { . 

This lack of robustness is a feature only of the simple Hebb rule. The Percep- 
tron learning algorithm is robust to the correlations in the examples provided 
by the spin glass teacher: perfect learning is achieved in the a — > oo limit. The 
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learning exponent x, however, seems to be 1/3 instead of the theoretically ex- 
pected value 1/4. This is a finite size effect which cannot be eliminated by 
using larger systems due to the following non-uniform convergence phenom- 
ena. In the Palmer-Pond distribution, P(b) — > as |fe| — > 0, but in the SG 
simulations, P(b) assumes a finite value P(0) = a of order 0(1/ yN) at this 
point due to finite size effects^] [6,9]. A better description of the local field 
distribution is obtained by the form 

a(N) + \b\ ( b 2 \ 



Repeating the calculation with the above distribution we found that any finite 
parameter a changes the large a behavior, leading to x — 1/3, the same value 
found for the r = distribution. Thus, learning the spin-glass Hamiltonian is 
easier for finite N. 



Table 1 

Learning exponent x. The first two columns were obtained by numerical simulations 
for increasing N and making an extrapolation to N —* oo; the last two, by analytical 
calculations using the distribution Eq. (7). 





Spin Glass 


IID PP 


r=0 


r=l 


HEBB 


0.001 ± 0.005 


0.52 ± 0.01 


1/2 


1/2 


PERCEPTRON 


0.314 ± 0.01 


0.245 ± 0.01 


1/3 


1/4 



4 Robustness of the 'Optimal algorithm' 



The optimal performance for on-line learning in the class of distributions ex- 
amined here is given by the prescription [4,5] 



F$\(T, h) = ^Q (p-V(6) 6 | hl(T - ah) , (12) 

H l ~ p2 rr d In ^foft) 
^ p 2 dh P (h) ' 

where {(...)) ^ = f db(. . .) P {r , s) (b\h, a), a = sgn(6) and P (h) = e~ h2 / 2 /yfa. 
The distribution P( r s )(cr, h) is obtained from the joint distribution P( r>s )(b, h) = 



5 This finite size effect should not to be confused with the finite P(0) found in the 
replica symmetric calculation of Roberts [10], which is presumably wrong due to 
broken replica symmetry. 
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P(h\b)P( r , s )(b) by integration over |6|, and the distribution P^ r ^(b\h, a) is equal 

to P( r ,s)(b, h)/P(r t8 )(a, h). 

Since optimal algorithms are obtained always for specific distributions, they 
could suffer from lack of robustness. This possibility seems not to be a serious 
problem in the absence of specific knowledge of, e.g., r and s. We have done 
simulations with 

^ ^ e -h 2 /2\ 2 

F * = V Q -7^H(-ah/\) ' (13) 



where A = a/1 — p' 2 / p and H(x) = f£° Du, which is the optimal algorithm for 
Gaussian teacher local fields (r = 0) with unit variance [4]. The examples, 
however, are generated with the Palmer-Pond distribution with parameters 
r = l,s = 1.05 and with the spin glass teacher. Although not optimal, the 
performance of the Fjffi is better than standard algorithms, see Fig. 1. 

Thus, although derived for specific distributions of examples, optimal algo- 
rithms can be used successfully for other distributions. The robustness of the 
optimal algorithm arises as a very welcome property, since in real world prob- 
lems the examples may be nontrivially distributed in an unknown manner. 



5 Conclusion 



We studied the learning process in neural networks in a scenario at the midway 
between the simple distributions of examples studied so far in the literature 
[7,8] and real world problems. The true distribution P(S) of 'examples' (local 
minima) generated by the spin-glass is unknown, but the teacher system is 
yet realizable by the student network. We have compared the performance of 
standard algorithms in this spin-glass problem with theoretical and simulation 
results for examples with a Palmer-Pond distribution P( rjS )(6) for the local 
fields. 

Various extensions on this scenario can be devised: we may study teachers 
with more structured distribution of local minima such as Hopfield networks. 
We expect that these Hamiltonians are harder to learn since they have less 
local minima. For example, a ferromagnetic Hamiltonian with only two global 
minima is unlearnable because many J vectors are compatible with these two 
'examples' [1]. Another interesting extension is to learn from a teacher which 
generates examples at a non zero temperature. This corresponds to learning 
from 'noisy examples' with a noise level depending in a non trivial way on 
the temperature T. Since it has been demonstrated that it is possible to learn 
perfectly from noisy examples [5,11], we expect that this task is also learnable. 
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Finally, we think that our work opens an unexplored learning scenario where 
the distribution of examples is generated dynamically by the teacher system 
but the teacher architecture is yet realizable by the student. Another example 
could be the learning from examples generated from the attractor time-series 
of a recurrent perceptron [12,13]. It is worthwhile to study these realizable 
cases since they define upper bounds for the performance achievable by neural 
networks. The approach of determining theoretical upper bounds for the effi- 
ciency of simple (thermal or computational) machines follows a long tradition 
in Thermodynamics and Statistical Physics. 
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